Skip to content

feat: consolidate image and bundle caches into LocalDB#790

Open
pingsutw wants to merge 10 commits intomainfrom
make-it-faster
Open

feat: consolidate image and bundle caches into LocalDB#790
pingsutw wants to merge 10 commits intomainfrom
make-it-faster

Conversation

@pingsutw
Copy link
Member

@pingsutw pingsutw commented Mar 13, 2026

Summary

  • Adds a persistent SQLite disk cache for image existence checks and code bundle uploads, consolidated into the shared LocalDB database (~/.flyte/local-cache/cache.db)
  • Image cache: after the first successful remote registry check (~4s for docker manifest inspect), subsequent runs hit the local SQLite cache (~0ms). Cache key is a SHA-256 of repository, tag, and architecture.
  • Bundle cache: if list_files_to_bundle returns the same digest as a previous run, the upload is skipped entirely and the cached remote path is reused. The cache check happens before create_bundle so tar creation is also avoided on cache hit.
  • Both caches have a 30-day TTL with probabilistic pruning (~5% of reads) to avoid stale entries without adding overhead on every access.
  • PersistentCacheImageChecker is inserted as the first checker in the DockerImageBuilder checker chain, before network-based checkers.

Key changes

  • _persistence/_db.py: Added image_cache and bundle_cache table DDLs to LocalDB, with a unified _ALL_TABLE_DDLS list for DRY initialization across sync and async paths
  • image_builder.py: Removed standalone ~/.flyte/cache/images.db — reads/writes now go through LocalDB.get_sync() with _write_lock for thread safety
  • bundle.py: Removed standalone ~/.flyte/cache/bundles.db — same consolidation pattern; uses lazy import of LocalDB to avoid circular imports
  • _task_cache.py: Added missing _write_lock to _set_sync and clear for consistency with all other sync write paths

Test plan

  • test_cached — verifies persistent cache is checked first and other checkers are skipped
  • test_persistent_cache_write_and_read — verifies write/read round-trip and arch isolation via SQLite
  • All existing image builder tests pass
  • Manual test: LOG_LEVEL=10 python examples/basics/hello.py — first run writes cache, second run shows "found in persistent cache" in ~0ms

🤖 Generated with Claude Code

pingsutw and others added 2 commits March 13, 2026 00:24
Adds PersistentCacheImageChecker that caches verified image URIs to
~/.flyte/cache/images/. After the first successful remote manifest
check (~4s), subsequent runs read from disk cache (~0ms), eliminating
repeated network round-trips to the container registry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
logger.debug(f"Image {image_uri} in registry")
return image_uri
# Persist to disk so future process invocations skip network checks
if checker is not PersistentCacheImageChecker:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the problem is the persistent cache never invalidates. If an image is deleted from the registry, the cache will still say it exists, and the build will be skipped

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it makes UX way more better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can solve this, by just having a very short TTL on the cache. this is why i am suggesting using sqlite. Anyways the data is tiny and one row is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short TTL sounds good to me. I'll update it to use sqlite

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Add TTL and cache for code bundle.

Signed-off-by: Kevin Su <pingsutw@apache.org>
@kumare3
Copy link
Contributor

kumare3 commented Mar 14, 2026

@pingsutw this is great, i was thinking of adding this and even code bundle hash to the local cache. Shall we just use the sqlite table to store this?

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw changed the title feat: add persistent disk cache for image existence checks feat: add SQLite disk cache for image checks and code bundle uploads Mar 17, 2026
Signed-off-by: Kevin Su <pingsutw@apache.org>
@kumare3
Copy link
Contributor

kumare3 commented Mar 18, 2026

@pingsutw I think we should just use one sqlite db. can you please use local persistence for this. This way it will be easier to cache this, clear it etc

Signed-off-by: Kevin Su <pingsutw@apache.org>
@pingsutw pingsutw changed the title feat: add SQLite disk cache for image checks and code bundle uploads feat: consolidate image and bundle caches into LocalDB Mar 20, 2026
"""Look up a previously uploaded bundle by its file digest. Returns (hash_digest, remote_path) or None."""
from flyte._persistence._db import LocalDB

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not crash if database not found

print_ls_tree(from_dir, files)

# Check persistent cache before creating the tar bundle to avoid unnecessary work
if not dryrun:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run the stress/runs_per_second to ensure it does not break

print_ls_tree(from_dir, files)

# Check persistent cache before creating the tar bundle to avoid unnecessary work
if not dryrun:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have a cache in _run.py. should we just add it there instead of here? it feels weird that this is using sqlite?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we will have to then repeat it for deploy and serve i guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants